Cream of the Crop 26

home *** CD-ROM | disk | FTP | other *** search

/ Cream of the Crop 26 / Cream of the Crop 26.iso / editor / dedupe12.zip / TECH.DOC < prev next >

Wrap

Text File | 1997-05-12 | 4KB | 60 lines

"DeDupe" Tecnical Information: Before I begin, "Source" is actually Reference. "Source Block", "Block2", "Source Line" (the Reference Line), and "Line2" (the Line that the Reference Line is Compared To) are All part of the Same File, which is Opened Twice. The Same File is Opened as the "Source" File, and again as "File2". I thought the Best Way to Remove Duplicate Lines Located Anywhere would be to "Read" (Load) the File by Blocks. "Source" Line 1 is Compared to Line 2, then to Line 3, ...... then to the Last Line of the Current Block. Next "Source" Line 2 is Compared to Line 3, then to Line 4...... then to the Last Line of the Current Block. Next "Source" Line 3, etc. The "Source" Line is Always Lower than "Line2" (the Line Compared To). When all the "Source" Lines of the Current "Source" Block are Finished, Load the Next "Block2" and Reset "Source" Block back to Line 1 again. Now Compare "Source" Line 1 to the first Line in the next "Block2" (the next Block of the File). When "Block2" is the Last Block of the File ("File2"), and all the "Source" Block Lines have been Compared to All the Lines in the Last "Block2", ReStart "File2" again and Advance "Source" Block to the Next Block of the File. Note: Until now, the "Source" Block has been the First Block of the File. Advance "File2" Lines (ReStarted "File2") until "Line2" is Past the Current "Source" Line (Now in the Next Block of the File), then resume Comparing Lines. This Process Ends when the "Source" Block reaches the Last Block of the File, and All "Source" Lines, except the Last Line (Can't Compare to the Same Line in the Same Block) are Compared. But wait, it's Not Over yet! Along the way, the Lines Compared that Match (Duplicate) are Marked (Setting a Bit) in a "Mark" Segment in Computer Memory. One final Pass Reads the entire File again, using a Line Counter, which is used to "Index" the Related Bit in the "Mark" Segment to the Relative Line of the File, in order to determine if the Line is a Duplicate or Not. This Process made the Project Complex, but the Alternative was to Limit the Size of the File from Small to Medium size (whatever can Fit into available Memory), or to Load the First Line and Compare it to Every Line (from Line 2) to the End of the File. Next, Line 2, to Every Line (from Line 3) to the End of the File. This would have made my "Project" Easier, but for a 30,000 Line File, the File would have to be Read (Loaded) 30,000 times, which is a good way to Shorten the Life of your Hard Drive. WHY IT TAKES "TIME": I did Not know, when I started this Project, how many Line Compares would take place. Test Procedures added to the Program, while it was under Development, included a "Cycle Counter", which Counted every time a Line Comparison took Place. I never imagined the Astronomical Number of Times Lines would be Compared in a Large File. A 301 Line File (Small File) required a Total of 45,150 Compare Cycles. If you Add just 10 Lines to that File, it Jumps to 48,205 Cycles, an Increase of 3,055 Cycles. If you Add another 10 Lines (Now 321 Lines), the Total Cycles Increased to 51,360, an additional increase of 3,155. Notice the differences also Increased (Non- Linear). Now Imagine a 20,000 Line File! This is the Reason it took about 12 minutes on my 75Mhz Pentium PC. Note: Two different 20,000 Line Files will Not take the Same amount of Time. It depends on the number of Characters in the Lines and how far into each line before a Mis-Match occurs (move on to the next Line).